fixes bug where multiple nodes fight over an eip #435

jubrad · 2024-03-15T05:41:20Z

    - introduces resource_ids to eip status
    - moves setting attachment status to be
    before an attempt at attaching
    - moves status update from patch to replace
    this should raise errors if the resource has
    changed allowing the resource_id to function
    somewhat as a lock
    - resource won't attempt to claim a resource
    with a resource id
    - fixes cargo deny errors

Alternative to #348

I believe this still has issues:

ensuring new eips that get created are associated with the correct resources in a timely manner.
the dns target annotation is not removed for pods
there are no guarantees that a eip who has lost its resource id will be assigned to a new eip

The main issue with an approach like this:
when a node/pod drops an eip we're left up to the pod/node reconciliation cycle to reassign. The EIP will reconcile due status change, but without adding considerable amount of code like was done in 348, EIPs won't attach themselves to resource or cause resources to re-reconcile leading them setting up the attachment to the eips.
^ Update to this
I've now pushed a change where the eip will use a dynamic resource provider to find all resource matching it's selector and update a label with a timestamp leading the resource to be reconciled. This seems to work really well. Let me know if there are concerns.

- introduces resource_ids to eip status - moves setting attachment status to be before an attempt at attaching - moves status update from patch to replace this should raise errors if the resource has changed allowing the resource_id to function somewhat as a lock - resource won't attempt to claim a resource with a resource id - fixes cargo deny errors

eip_operator/src/controller/node.rs

alex-hunt-materialize · 2024-03-22T12:42:44Z

eip_operator/src/controller/node.rs

+                Err(err)
+                    if err
+                        .to_string()
+                        .contains("Operation cannot be fulfilled on eips.materialize.cloud") =>


What error type is this? If it is from Kubernetes or something we defined, we should be able to match a more specific type than just doing string matching.

I think this is an eip_operator::Error that comes from a kube::Error. I want to say I attempted to match an error, but this was just easier and iirc I might've needed a guard regardless.

eip_operator/src/controller/node.rs

eip_operator/src/eip.rs

eip_operator/src/controller/pod.rs

eip_operator/src/controller/eip.rs

eip_operator/src/controller/node.rs

doy-materialize · 2024-03-22T21:39:59Z

eip_operator/src/controller/node.rs

+                        "Pod {} failed to claim eip {}, rescheduling to try another",
+                        name, eip_name
+                    );
+                    return Ok(Some(Action::requeue(Duration::from_secs(1))));


why do we have this special handling at all? if the reconcile loop returns an error, it will already log an error and schedule a rereconcile in the very near future - i don't quite understand what we gain from handling this manually

If we're expecting this as a valid possibility due to a conflict and we're able to appropriately handle it is there really an error? I can just return the error, but I think it makes logs more confusing, than having an info message that indicates this sort of thing was expected.

alex-hunt-materialize

This looks great! I have one remaining question and a few nitpicks, but all the logic looks correct to me.

alex-hunt-materialize · 2024-03-26T15:01:13Z

eip_operator/src/controller/node.rs

+                s.resource_id.is_none()
+                    || s.resource_id.as_ref().map(|r| r == &name).unwrap_or(false)
+            })
+        });


How do we handle the migration from older versions of the EIP operator, which don't have resource_id's? Is this OK just because we don't have multiple nodes that match currently?

yeah, there could be some re-alignment during this migration if there are multiple nodes matching a single eip, but it shouldn't be any more unstable than the existing implementation.

eip_operator/src/controller/node.rs

eip_operator/src/controller/pod.rs

eip_operator/src/eip.rs

alex-hunt-materialize

I left a bunch of nitpicks in my previous review comments, but won't block the PR on those.

jubrad requested a review from a team as a code owner March 15, 2024 05:41

jubrad force-pushed the fix-contention-issue-simple branch from 10ee2a8 to e2c9b82 Compare March 15, 2024 05:53

jubrad requested review from alex-hunt-materialize and doy-materialize March 15, 2024 14:10

jubrad force-pushed the fix-contention-issue-simple branch from e2c9b82 to 5699eed Compare March 15, 2024 17:02

jubrad force-pushed the fix-contention-issue-simple branch from 5699eed to 484a96e Compare March 15, 2024 17:38

alex-hunt-materialize reviewed Mar 22, 2024

View reviewed changes

doy-materialize reviewed Mar 22, 2024

View reviewed changes

jubrad force-pushed the fix-contention-issue-simple branch from bd85c36 to 4f7efc7 Compare March 23, 2024 06:02

address pr comments 1

271f1c1

jubrad force-pushed the fix-contention-issue-simple branch 3 times, most recently from 1edfaac to 1c3cb13 Compare March 26, 2024 03:50

cover more edge cases and rename some fns

b3bd1b0

jubrad force-pushed the fix-contention-issue-simple branch from 1c3cb13 to b3bd1b0 Compare March 26, 2024 05:55

jubrad requested review from alex-hunt-materialize and doy-materialize March 26, 2024 05:55

alex-hunt-materialize reviewed Mar 26, 2024

View reviewed changes

alex-hunt-materialize approved these changes Mar 26, 2024

View reviewed changes

small cleanups

ed7cbc9

jubrad merged commit 164383c into main Mar 27, 2024
3 checks passed

jubrad deleted the fix-contention-issue-simple branch March 27, 2024 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixes bug where multiple nodes fight over an eip #435

fixes bug where multiple nodes fight over an eip #435

jubrad commented Mar 15, 2024 •

edited

Loading

alex-hunt-materialize Mar 22, 2024

jubrad Mar 22, 2024

doy-materialize Mar 22, 2024

jubrad Mar 23, 2024

alex-hunt-materialize left a comment

alex-hunt-materialize Mar 26, 2024

jubrad Mar 26, 2024

alex-hunt-materialize left a comment

fixes bug where multiple nodes fight over an eip #435

fixes bug where multiple nodes fight over an eip #435

Conversation

jubrad commented Mar 15, 2024 • edited Loading

alex-hunt-materialize Mar 22, 2024

Choose a reason for hiding this comment

jubrad Mar 22, 2024

Choose a reason for hiding this comment

doy-materialize Mar 22, 2024

Choose a reason for hiding this comment

jubrad Mar 23, 2024

Choose a reason for hiding this comment

alex-hunt-materialize left a comment

Choose a reason for hiding this comment

alex-hunt-materialize Mar 26, 2024

Choose a reason for hiding this comment

jubrad Mar 26, 2024

Choose a reason for hiding this comment

alex-hunt-materialize left a comment

Choose a reason for hiding this comment

jubrad commented Mar 15, 2024 •

edited

Loading